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Abstract 



< . 

This paper proposes a method to construct an adaptive agent that is universal with respect 
to a given class of experts, where each expert is an agent that has been designed specifically 
for a particular environment. This adaptive control problem is formalized as the problem 
of minimizing the relative entropy of the adaptive agent from the expert that is most 
suitable for the unknown environment. If the agent is a passive observer, then the optimal 
■ solution is the well-known Bayesian predictor. However, if the agent is active, then its past 

, actions need to be treated as causal interventions on the I/O stream rather than normal 

\& • probability conditions. Here it is shown that the solution to this new variational problem is 

given by a stochastic controller called the Bayesian control rule, which implements adaptive 
behavior as a mixture of experts. Furthermore, it is shown that under mild assumptions, 
the Bayesian control rule converges to the control law of the most suitable expert. 
Keywords: Artificial Intelligence, Minimum Relative Entropy Principle, Bayesian Con- 
trol Rule, Interaction Sequences, Operation Modes. 



1. Introduction 

When the behavior of an environment under any control signal is fully known, then the 
designer can choose an agen10 that produces the desired dynamics. Instances of this problem 
include hitting a target with a cannon under known weather conditions, solving a maze 
having its map and controlling a robotic arm in a manufacturing plant. However, when 
the behavior of the plant is unknown, then the designer faces the problem of adaptive 
control. For example, shooting the cannon lacking the appropriate measurement equipment, 
finding the way out of an unknown maze and designing an autonomous robot for Martian 
exploration. Adaptive control turns out to be far more difficult than its non-adaptive 
counterpart. This is because any good policy has to carefully trade off explorative versus 
exploitative actions, i.e. actions for the identification of the environment's dynamics versus 



1. In accordance with the control literature, we use the terms agent and controller interchangeably. Simi- 
larly, the terms environment and plant are used synonymously. 
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actions to control it in a desired way. Even when the environment's dynamics are known 
to belong to a particular class for which optimal agents are available, constructing the 
corresponding optim al adaptive agent is in general computationally intractable even for 
simple toy problems (|Duffl . 12002 ) . Thus, finding tractable approximations has been a major 
focus of research. 

Recently, it has been proposed to reformulate the problem statement for some classes of 
control problems based on the minimization of a relative entropy criterion. For example, a 
large class of optimal control problems can be solved very efficiently if the problem statement 
is reformulated as the minimiza tion of the devi a tion of the dynamics of a c ontrolled system 



from the uncontrolled system ( Todorov . 2006 . 2009 ; Kappen et al. . 20091 ). In this work 



a similar approach is introduced. If a class of agents is given, where each agent solves a 
different environment, then adaptive controllers can be derived from a minimum relative 
entropy principle. In particular, one can construct an adaptive agent that is universal 
with respect to this class by minimizing the average relative entropy from the environment- 
specific agent. 

However, this extension is not straightforward. There is a syntactical difference between 
actions and observations that has to be taken into account when formulating the variational 
problem. M ore specifical l y, actions have to be treated as i nterventions obeying the rules 
of causality (|Pearll . l200d : iSpirtes et all l200d : bawidl . l20ld ). If this distinction is made, 
the variational problem has a unique solution given by a stochastic control rule called the 
Bayesian control rule. This control rule is particularly interesting because it translates the 
adaptive control problem into an on-line inference problem that can be applied forward 
in time. Furthermore, this work shows that under mild assumptions, the adaptive agent 
converges to the environment-specific agent. 

The paper is organized as follows. Section [2] introduces notation and sets up the adaptive 
control problem. Section [3] formulates adaptive control as a minimum relative entropy 
problem. After an initial, naive approach, the need for causal considerations is motivated. 
Then, the Bayesian control rule is derived from a revised relative entropy criterion. In 
Section [U the conditions for convergence are examined and a proof is given. Section [5] 
illustrates the usage of the Bayesian control rule for the multi-armed bandit problem and 
the undiscounted Markov decision problem. Section [6] discusses properties of the Bayesian 
control rule and relates it to previous work in the literature. Section [7] concludes. 



2. Preliminaries 

In the following both agent and environment are formalized as causal models over I/O 
sequences. Agent and environment are coupled to exchange symbols following a standard 
interaction protocol having discrete time, observation and control signals. The treatment 
of the dynamics are fully probabilistic, and in particular, both actions and observations are 
random variables, which is in contrast to the decision-theo r etic a gent formulation treating 
only observations as random variables ( Russell and Norvig . 2003). All proofs are provided 
in the appendix. 

Notation. A set is denoted by a calligraphic letter like A. The words set & alphabet 
and element & symbol are used to mean the same thing respectively. Strings are finite 
concatenations of symbols and sequences are infinite concatenations. A n denotes the set of 
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strings of length n based on A, and A* ■= [j n>0 A n ^ s the se * °f finite strings. Further- 
more, A°° := {a±a2 • • • |oj £ A for all i = 1,2,...} is defined as the set of one-way infinite 
sequences based on the alphabet A. Tuples are written with parentheses (01,02,03) or as 
strings 010203. For substrings, the following shorthand notation is used: a string that runs 
from index i to k is written as ai± := OjOj+i . . . Ofc_iOfc. Similarly, a<j := 0102 ... Oj is a 
string starting from the first index. Also, symbols are underlined to glue them together 
like ao in ao<j := 01010202 . . . OjOj. The function log(x) is meant to be taken w.r.t. base 2, 
unless indicated otherwise. 

Interactions. The possible I/O symbols are drawn from two finite sets. Let O denote the 
set of inputs (observations) and let A denote the set of outputs (actions). The set Z := AxO 
is the interaction set. A string ao <t or ao <t at is an interaction string (optionally ending in 
a t or ot) where G A and G O. Similarly, a one-sided infinite sequence a\ o\ai02 ... is 
an interaction sequence. The set of interaction strings of length t is denoted by Z t . The 
sets of (finite) interaction strings and sequences are denoted as Z* and Z°° respectively. 
The interaction string of length is denoted by e. 

I/O system. Agents and environments are formalized as I/O systems. An I/O system is 
a probability distribution Pr over interaction sequences Z°°. Pr is uniquely determined by 
the conditional probabilities 

Pr(a t \ao <t ), Pr(o t \ao <t a t ) (1) 

for each qo <t £ Z*. However, the semantics of the probability distribution Pr are only fully 
defined once it is coupled to another system. 



Agent 
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Figure 1: The model of interactions. The agent P and the environment Q define a proba- 
bility distribution over interaction sequences. 



Interaction system. Let P, Q be two I/O systems. An interaction system (P, Q) is a 
coupling of the two systems giving rise to the generative distribution G that describes the 
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probabilities that actually govern the I/O stream once the two systems are coupled. G is 
specified by the equations 

G(a t \qo <t ) V(at\ao <t ) 
G(ot\ao<t<h) '■= Q(ot\ao<t a t) 

valid for all ao t 6 Z* . Here, G models the true probability distribution over interaction 
sequences that arises by coupling two systems through their I/O streams. More specifically, 
for the system P, P(at\ao <t ) is the probability of producing action at £ A given history 
ao <t and P(ot\ao <t a t ) is the predicted probability of the observation o t £ O given history 
ao <t at. Hence, for P, the sequence o\Oi ... is its input stream and the sequence a\ai ... is 
its output stream. In contrast, the roles of actions and observations are reversed in the case 
of the system Q. Thus, the sequence 01O2 ... is its output stream and the sequence a\a2 ■ ■ ■ 
is its input stream. This model of interaction is fairly general, and many other interaction 
protocols can be translated into this scheme. As a convention, given an interaction system 
(P,Q), P is an agent to be constructed by the designer, and Q is an environment to be 
controlled by the agent. Figure [1] illustrates this setup. 

Control Problem. An environment Q is said to be known iff the agent P is such that 
for any qo <t £ Z* , 

P(o t \ao <t a t ) = Q(o t \ao <t a t ). 

Intuitively, this means that the agent "knows" the statistics of the environment's future 
behavior under any past, and in particular, it "knows" the effects of given controls. If the 
environment is known, then the designer of the agent can build a custom-made policy into 
P such that the resulting generative distribution G produces interaction sequences that are 
desirable. This can be done in multiple ways. For instance, the controls can be chosen 
such that the resulting policy maximizes a given utility criterion; or such that the resulting 
trajectory of the interaction system stays close enough to a prescribed trajectory. Formally, 
if Q is known, and if the conditional probabilities P(at\ao <t ) for all ao <t £ Z* have been 
chosen such that the resulting generative distribution G over interaction sequences given 
by 

G(a t \ao <t ) = P{a t \ao <t ) 
G(o t \ao <t a t ) = Q{ot\ao <t at) = P(ot\ao <t a t ) 

is desirable, then P is said to be tailored to Q. 

Adaptive control problem. If the environment Q is unknown, then the task of designing 
an appropriate agent P constitutes an adaptive control problem. Specifically, this work 
deals with the case when the designer already has a class of agents that are tailored to the 
class of possible environments. Formally, it is assumed that Q is going to be drawn with 
probability P(m) from a set Q := {Q m }meM °f possible systems before the interaction 
starts, where A4 is a countable set. Furthermore, one has a set V '■= {P m }meM °f systems 
such that for each m £ M., P m is tailored to Q m and the interaction system (P m ,Q m ) 
has a generative distribution G m that produces desirable interaction sequences. How can 
the designer construct a system P such that its behavior is as close as possible to the 
custom-made system P m under any realization of Q m £ Ql 
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3. Adaptive Systems 

The main goal of this paper is to show that the problem of adaptive control outlined in 
the previous section can be reformulated as a universal compression problem. This can be 
informally motivated as follows. Suppose the agent P is implemented as a machine that is 
interfaced with the environment Q. Whenever the agent interacts with the environment, 
the agent's state changes as a necessary consequence of the interaction. This "change in 
state" can take place in many possible ways: by updating the internal memory; consulting 
a random number generator; changing the physical location and orientation; and so forth. 
Naturally, the design of the agent facilitates some interactions while it complicates others. 
For instance, if the agent has been designed to explore a natural environment, then it might 
incur into a very low memory footprint when recording natural images, while being very 
memory-inefficient when recording artificially created images. If one abstracts away from 
the inner workings of the machine and decides to encode the state transitions as binary 
strings, then the minimal amount of resources in bits that are required to implement these 
state changes can be derived directly from the associated probability distribution P. In 
the context of adaptive control, an agent can be constructed such that it minimizes the 
expected amount of changes necessary to implement the state transitions, or equivalently, 
such that it maximally compresses the experience. Thereby, compression can be taken as a 
stand-alone principle to design adaptive agents. 



3.1 Universal Compression and Naive Construction of Adaptive Agents 

In coding theory, the problem of compressing a sequence of observations from an unknown 
source is known as the adaptive coding problem. This is solved by constructing universal 
compressors, i.e. codes that adapt on-the-fly to any source within a predefined class. Such 
codes are obtained by minimizing the average deviation of a predictor from the true source, 
and then by constructing codewords usin g the predictor. In t his sub section, this procedure 



will be used to derive an adaptive agent (jOrtega and Braunl . l2010al ). 



Formally, the deviation of a predictor P from the a true distribution P TO is measured 
by the relative entropu^. A first approach would be to construct an agent B so as to 
minimize the total expected relative entropy to P m . This is constructed as follows. Define 
the history-dependent relative entropies over the action at and observation ot as 



D%(m<t) 

D°^{ao <t at) ■-- 



:^P m (a f |ao <f )log ft(flf|gg<t) 

Ot 



Pr( 

s, P 'm(ot\qo<t a t) 

o t \ao <t at) log r-, 

Pr(o t \ao<t a t) 



2. The relative entropy is also known as the KL-divergence and it measures the average amount of extra 
bits that are necessary to encode symbols due to the usage of the (wrong) predictor. 
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where Pr will be the argument of the variational problem. Then, one removes the depen- 
dency on the past by averaging over all possible histories: 

DZ :=Y,^m{ao <t )D^{ao < t) 

22<t 

D m '■= F rn{ao<tat)DZ{ao<t a t)- 

ao <t a t 

Finally, the total expected relative entropy of Pr from P m is obtained by summing up all 
time steps and then by averaging over all choices of the true environment: 



D := lim sup £ P(m) £ (£>£ + £>£). 



t— >oo 



(2) 



T=l 



Using ([5]), one can define a variational problem with respect to Pr. The agent B that one 
is looking for is the system Pr that minimizes the total expected relative entropy in ([2]), i.e. 



B := argminD(Pr). 

Pr 

The solution to Equation is the system B defined by the set of equations 

B(a t \qo <t ) = P m (a t \ao <t )w m (qo <t ) 

m 

B(o t \ao <t a t ) = P m (o t \ao <t a t )w m (ao <t a t ) 

m 

valid for all ao <t G Z*, where the mixture weights are 

P(m)P m (qo <t ) 



(3) 



(4) 



w m (ao<t) := 
w m (ao <t a t ) 



m! ( CLO^- fl 

P(m)P m (ao <t a t ) 



(5) 



Em' P ( m ') P m' (ao <t at)' 

For reference, see Haussler and Qpperl (1997) and Qpperl ( 19981 ). It is clear that B is just 
the Bayesian mixture over the agents P m . If one defines the conditional probabilities 

P(a t \m,ao <t ) := P m (a t \ao <t ) 
P(o t \m,ao <t a t ) := P m (a t \ao <t a t ) 

for all ao <t £ Z* , then Equation [4] can be rewritten as 



(6) 



B(a t \qo<t) = YP(at\m,ao <t )P(m\ao <t ) = P{a t \ao<t) 

m 

B(o t \ao <t a t ) = Y^P{ot\rn,ao <t a t )P{m\ao <t a t ) = P(o t \ao <t a t ) 



(7) 



where the P(m\ao <t ) = w m (ao <t ) and P(m\ao <t at.) = w m (ao <t at) are just the posterior 
probabilities over the elements in A4 given the past interactions. Hence, the conditional 
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probabilities in (jH) that minimize the total expected divergence are just the predictive 
distributions P(at\ao <t ) and P(ot\ao <t at) that one obtains by standard probability theory, 
and in particular, Bayes' rule. This is interesting, as it provides a teleological justification 
for Bayes' rule. 

The behavior of B can be described as follows. At any given time t, B maintains a 
mixture over systems P m . The weighting over them is given by the mixture coefficients 
w m . Whenever a new action at or a new observation ot is produced (by the agent or 
the environment respectively), the weights w m are updated according to Bayes' rule. In 
addition, B issues an action at suggested by a system P m drawn randomly according to the 
weights w t - 

However, there is an important problem with B that arises due to the fact that it is not 
only a system that is passively observing symbols, but also actively generating them. In 
the subjective interpretation of probability theory, conditionals play the role of observations 
made by the agent that have been generated by an external source. This interpretation suits 
the symbols 01,02,03, . . . because they have been issued by the environment. However, sym- 
bols that are generated by the system itself require a fundamentally different belief update. 
Intuitively, the difference can be explained as follows. Observations provide information 
that allows the agent inferring properties about the environment. In contrast, actions do 
not carry information about the environment, and thus have to be incorporated differently 
into the belief of the agent. In the following section we illustrate this problem with a simple 
statistical example. 

3.2 Causality 

Causality is the study of the functional dependencies of events. This stands in contrast to 
statistics, which, on an abstract level, can be said to study the equivalence dependencies 
(i.e. co-occurrence or correlation) amongst events. Causal statements differ fundamentally 
from statistical statements. Examples that highlight the differences are many, such as 
"do smokers get lung cancer?" as opposed to "do smokers have lung cancer?"; "assign 
y <— f(x)" as opposed to "compare y = f(x)" in programming languages; and "a <— F/m v 
as opposed to U F = m a" in Newtonian physics. The study of causality has recently enjoyed 
considerable attention from the researchers in the fields of statistics and machine learning. 
Especially over the last dec ade, significant pro g ress has been mad e towards the forma l 
understanding of causation ( Shafer . 19961 ; Pearl . 2000l ; Spirtes et al. . 2000l ; Dawid, 20ld ). 



In this subsection, the aim is to provide the essential tools required to understand causal 
interventions. For a more in-depth exposition of causality, the reader is referred to the 
specialized literature. 

To illustrate the need for causal considerations in the case of generated symbols, consider 
the following thought experiment. Suppose a statistician is asked to design a model for a 
simple time series X\, X2, X$, . . . and she decides to use a Bayesian method. Assume she 
collects a first observation X\ = x±. She computes the posterior probability density function 
(pdf) over the parameters 9 of the model given the data using Bayes' rule: 



f p{X x = xi\9')p{e')d9 r ' 
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where piXx = x\\6) is the likelihood of x\ given 9 and p(9) is the prior pdf of 9. She can 
use the model to predict the next observation by drawing a sample X2 from the predictive 
pdf 

p(X 2 = x 2 \X 1 = x x ) = J p(X 2 = x 2 \X 1 = x 1 ,e)p(6\X 1 = x 1 )d6 1 

where p(X 2 = x 2 \X\ = x\,9) is the likelihood of x 2 given x\ and 9. She understands 
that the nature of x 2 is very different from x±: while x\ is informative and does change 
the belief state of the Bayesian model, x 2 is non-informative and thus is a reflection of the 
model's belief state. Hence, she would never use x 2 to further condition the Bayesian model. 
Mathematically, she seems to imply that 

p(9\X 1 = x 1 ,X 2 = x 2 )=p(9\X 1 =x 1 ) 

if x 2 has been generated from p(X 2 \Xi = x\) itself. But this simple independence assump- 
tion is not correct as the following elaboration of the example will show. 

The statistician is now told that the source is waiting for the simulated data point x 2 
in order to produce a next observation X 3 = x 3 which does depend on x 2 . She hands in x 2 
and obtains a new observation X3. Using Bayes' rule, the posterior pdf over the parameters 
is now 

p{X 3 = x^Xi = x u X 2 = x 2 ,9)p(X 1 = Xl \9)p(9) 
fp(X 3 = x 3 \X 1 = xx,X 2 = x 2 ,9')p(Xx=x 1 \9')p(9')d9> 1 ' 

where p(X 3 = X3IX1 = x\,X 2 = x 2 ,9) is the likelihood of the new data X3 given the old 
data xi, the parameters 9 and the simulated data x 2 . Notice that this looks almost like the 
posterior pdf p(9\X\ = x±,X 2 = x 2 ,X^ = X3) given by 

p(X 3 = X3 \X 1 = x 1 ,X 2 = x 2 ,9)p{X 2 = x 2 \X 1 = x ll 9)p(X 1 = Xl \9)p(9) 
f p(X 3 = x 3 \X 1 = X!,X 2 = x 2 ,9')p(X 2 = x 2 \X 1 = x 1 ,9')p{X 1 = x 1 \9')p(9')d9' 

with the exception that in the latter case, the Bayesian update contains the likelihoods of 
the simulated data p(X 2 = x 2 \X\ = x\, 9). This suggests that Equation [8] is a variant of the 
posterior pdf p(9\X\ = x\,X 2 = x 2 ,X 3 = x 3 ) but where the simulated data x 2 is treated 
in a different way than the data x\ and x 3 . 

Define the pdf p' such that the pdfs p'(9), p'{X\\9), p'{X 3 \X\, X 2 , 9) are identical to 
p{0), p{Xi\9) and p(X 3 \X 2 , X 1 ,9) respectively, but differ in p'(X 2 \Xx, 9): 

p'(X 2 \X 1 ,9) = 5(X 2 -x 2 ). 

where 5 is the Dirac delta function. That is, p' is identical to p but it assumes that the 
value of X 2 is fixed to x 2 given X\ and 9. For p', the simulated data x 2 is non-informative: 

-log 2 p'(X 2 = x 2 \X 1 ,9) = 0. 

If one computes the posterior pdf p'(9\Xi = xi,X 2 = x 2 ,X 3 = x 3 ), one obtains the result 
of Equation [8) 

p'(X 3 = x 3 \X x = x u X 2 = x 2 ,9)p'(X 2 = x 2 \X x = x 1 ,9)p'(X 1 = Xl \9)p'(9) 
f p'(X 3 = x 3 \Xx = X1 ,X 2 = x 2 ,9')p'(X 2 = x 2 \X x = xx^p'iXx = Xl \9')p'{9>) d9' 
_ P(X 3 = xgjXi = x 1; X 2 = x 2 ,9)p(X 1 = Xl \9)p{9) 
Jp(X 3 = x 3 \X 1 = x 1 ,X 2 = x 2 ,9')p(X l = x x \9')p{9')d9 r 
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Thus, in order to explain Equation [8] as a posterior pdf given the observed data x\ and X3 
and the generated data x 2 , one has to intervene p in order to account for the fact that X2 
is non-informative given x\ and 9. In other words, the statistician, by defining the value 
of X2 herself, has changed the (natural) regime that brings about the series Xi, X2, A3, . . ., 
which is mathematically expressed by redefining the pdf. 

Two essential ingredients are needed to carry out interventions. First, one needs to 
know the functional dependencies amongst the random variables of the probabilistic model. 
This is provided by the causal model, i.e. the unique factorization of the joint probability 
distribution over the random variables encoding the causal dependencies. In the general 
case, this defines a partial order over the random variables. In the previous thought exper- 
iment, the causal model of the joint pdf p(9, X\, X2, A3) is given by the set of conditional 
pdfs 

p(0),p(X 1 \9), P (X2\X 1 ,9), P (X 3 \X 1 ,X2,9). 

Second, one defines the intervention that sets X to the value x, denoted as A x, as 
the operation on the causal model replacing the conditional probability of A by a Dirac 
delta function 5(X — x) or a Kronecker delta 5^ for a continuous or a discrete variable A 
respectively. In our thought experiment, it is easily seen that 

p'(9,X 1 = x 1 ,X 2 = x 2 ,X 3 = x 3 ) = p(9,X 1 = xi,X 2 <- x 2 , A 3 = x 3 ) 

and thereby, 

p'(0|Ai = xi,X 2 = x 2 , A 3 = x 3 ) = p{9\Xi = xi,X 2 <- x 2 ,X 3 = x 3 ). 

Causal models contain additional information that is not available in the joint probability 
distribution alone. The appropriate model for a given situation depends on the story that 
is being told. Note that an intervention can lead to different results if their causal models 
differ. Thus, if the causal model had been 

p(X 3 ),p(X 2 \X 3 ),p(X 1 \X 2 ,X 3 ),p(9\X 1 ,X 2 ,X 3 ) 

then the intervention X 2 <— X2 would differ from p', i.e. 

p'(9,X 1 = xi,X 2 = x 2 ,X 3 = x 3 ) / p(9, Ai = xi,X 2 <- x 2 ,X 3 = x 3 ), 

even though both causal models represent the same joint probability distribution. In the 
following, this paper will use the shorthand notation x := A <— x when the random variable 
is obvious from the context. 

3.3 Causal construction of adaptive agents 

Following the discussion in the previous section, an adaptive agent P is going to be con- 
structed by minimizing the expected relative entropy to the P m , but this time treating 
actions as interventions. Based on the definition of the conditional probabilities in Equa- 
tion [6l the total expected relative entropy to characterize P using interventions is going 
to be defined. Assuming the environment is chosen first, and that each symbol depends 
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functionally on the environment and all the previously generated symbols, the causal model 
is given by 

P(m),P(ax\m),P(ox\m, ai),P(a,2\m, ax, ox), P(o2\m, ax, ox, az), . . . 

Importantly, interventions index a set of intervened probability distributions derived from 
a base probability distribution. Hence, the set of fixed intervention sequences of the form 
&i, az, . . . indexes probability distributions over observation sequences 01,02,.... Because 
of this, one defines a set of criteria indexed by the intervention sequences, but it will be 
clear that they all have the same solution. Define the history-dependent intervened relative 
entropies over the action at and observation ot as 

C%iM<t) 

C%(ao <t a t ) 

where Pr is a given arbitrary agent. Note that past actions are treated as interventions. In 
particular, P(at\m,ao <t ) represents the knowledge state when the past actions have already 
been issued but the next action at is not known yet. Then, averaging the previous relative 
entropies over all pasts yields 

(ao<t) 

ao <t 

p (^2<t a t\m)C2(ao<t a t)- 

ao <t at 

Here again, because of the knowledge state in time represented by C^* (ao <t ) and C£* (ao <t at), 
the averages are taken treating past actions as interventions. Finally, define the total ex- 
pected relative entropy of Pr from P m as the sum of (C% + C£*) over time, averaged over 
the possible draws of the environment: 



t 




The variational problem consists in choosing the agent P as the system Pr minimizing 
C = C(Pr), i.e. 

P := argminC(Pr). (10) 

Pr 

The following theorem shows that this variational problem has a unique solution, which will 
be the central theme of this paper. 

Theorem 1 The solution to Equation\10\ is the system P defined by the set of equations 
P(a t \qo<t) = P(<k\ao<t) = ^ P(at\m,ao <t )v m (ao <t ) 

(11) 

P(o t \ao <t a t ) = P(o t \ao<t a t) = ^ P(ot\m,ao <t at)v m (ao <t a t ) 

m 



E„, A . , P(at\m, aow) 
P(°,KM<,)i°g, 

yP{otm,ao <t at)log 2 r 

*r? Pr{o t \ao <t at) 
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valid for all ao <t 6 Z* , where the mixture weights are 



v m (ao <t a t ) = v m (ao 



_ P(m) \\ T=1 P(o T \m,qo <T a T ) 
T, m ' P ( m ')U t r ~JiP( r\rn',qo < r a r) 



(12) 



The behavior of P differs in an important aspect from B. At any given time t, P 
maintains a mixture over systems P m . The weighting over these systems is given by the 
mixture coefficients v m . In contrast to B, P updates the weights v m only whenever a new 
observation ot is produced by the environment. The update follows Bayes' rule but treats 
past actions as interventions by dropping the evidence they provide. In addition, P issues 
an action at suggested by an system m drawn randomly according to the weights v m . 

Perhaps surprisingly, the theorem says that the optimal solution to the variational prob- 
lem in (jlOp is precisely the predictive distribution over actions and observations treating 
actions as interventions and observations as conditionals, i.e. it is the solution that one 
would obtain by applying only standard probability and causal calculus. This provides a 
teleological interpretation to the agent P akin to the naive agent B constructed in Sec- 
tion GEO 



3.4 Summary 



Adaptive control is formalized as the problem of designing an agent for an unknown envi- 
ronment chosen from a class of possible environments. If the environment-specific agents are 
known, then the Bayesian control rule allows constructing an adaptive agent by combining 
these agents. The resulting adaptive agent is universal with respect to the environment 
class. In this context, the constituent agents are called the operation modes of the adaptive 
agent. They are represented by causal models over the interaction sequences, i.e. condi- 
tional probabilities P(a+\m, ao <t ) and P(ot\m, ao <t ) for all ao <t G Z* , and where m £ M. 
is the index or parameter characterizing the operation mode. The probability distribution 
over the input stream (output stream) is called the hypothesis (policy) of the operation 
mode. The following box collects the essential equations of the Bayesian control rule. In 
particular, here the rule is stated using a recursive belief update. 
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Bayesian control rule: Given a set of operation modes 
{P(-\m, -)}meA4 over interaction sequences in Z°° and a prior distri- 
bution P(m) over the parameters A4, the probability of the action 
a t+ i is given by 



P(a t+ i\ao<t) = ^P(<H+i\rn,ga< t )P(m\m<t), ( 13 ) 



in 



where the posterior probability over operation modes is 



P(m\ao<t) 



P(ot\m, ao <t )P(m\qo <t ) 




Em' P(ot\m',ao <t )P(m'\ao <t )' 



4. Convergence 

The aim of this section is to develop a set of sufficient conditions of convergence and then 
to provide a proof of convergence. To simplify the exposition, the analysis has been limited 
to the case of controllers having a finite number of input-output models. 

4.1 Policy diagrams 

In the following we use "policy diagrams" as a useful informal tool to analyze the effect of 
policies on environments. Figure El illustrates an example. 



Figure 2: A policy diagram. One can imagine an environment as a collection of states 



connected by transitions labeled by I/O symbols. The zoom highlights a state 
s where taking action a £ A and collecting observation o G O leads to state 
s'. Sets of states and transitions are represented as enclosed areas similar to 
a Venn diagram. Choosing a particular policy in an environment amounts to 
partially controlling the transitions taken in the state space, thereby choosing a 
probability distribution over state transitions (e.g. a Markov chain given by the 
environmental dynamics). If the probability mass concentrates in certain areas 
of the state space, choosing a policy can be thought of as choosing a subset of the 
environment's dynamics. In the following, a policy is represented by a subset in 
state space (enclosed by a directed curve) as illustrated above. 



Policy diagrams are especially useful to analyze the effect of policies on different hypothe- 
ses about the environment's dynamics. An agent that is endowed with a set of operation 
modes A4 can be seen as having hypotheses about the environment's underlying dynam- 
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ics, given by the observation models P(ot\m,ao <t at), and associated policies, given by the 
action models P(at\m,ao <t ), for all m 6 Ai. For the sake of simplifying the interpreta- 
tion of policy diagrams, we will assumed the existence of a state space S and a function 
T : (A x O)* — > S mapping I/O histories into states. With this assumption, policies and 
hypotheses can be seen as conditional probabilities 

P(at\m,s) := P(a t \m,ao <t ) 
and P(o t \m,s,a t ) ■= P(o t \m,ao <t a t ) 

respectively, defining transition probabilities 

P(s'\m, s) = P(aOj\m, s) 
S' 

for a Markov chain in the state space, where s = T{ao <t ) and S' contains the transitions 
ao t such that T(ao <t ) = s' . 



4.2 Divergence processes 

The central question in this section is to investigate whether the Bayesian control rule con- 
verges to the correct control law or not. That is, whether P(at\ao t ) — > P(at\m* , ao <t ) 
as t — > oo when m* is the true operation mode, i.e. the operation mode such that 
P(at\m*, ao <t ) = Q(at\ao <t ). As will be obvious from the discussion in the rest of this 
section, this is in general not true. 

As it is easily seen from Equation [T3l showing convergence amounts to show that the 
posterior distribution P(m\ao <t ) concentrates its probability mass on a subset of operation 
modes M* having essentially the same output stream as m*, 



P(at\m,ao <t )P(rn\ao <t ) P(a t \m* ,ao <t )P(m\ao <t ) ki P(at\m* ,ao <t ). 



E 



Hence, understanding the asymptotic behavior of the posterior probabilities 

P(m\ao<t) 

is crucial here. In particular, we need to understand under what conditions these quantities 
converge to zero. The posterior can be rewritten as 

P(ao< t \m)P(m) P(m) J]* =1 P(o T \m, ao <T a T ) 

l y JTl | QjO <^ j- J 



Em'eM P(ao< t \m')P(m') J^m'eM P ( m ') UUi P (o T \m> , ao <T a T ) ' 

If all the summands but the one with index m* are dropped from the denominator, one 
obtains the bound 

t 



X3in-r, I * P ( m ) -r-r P(o T \ao <T a T \m) 

P{m\ao^.) < in — ; — ; — ; ; r 

v \—<tJ - p(r„*) 11 P(o T \ao <T a T \m*) 



3. Note however that no such assumptions are made to obtain the results of this section. 
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which is valid for all m* £ M. From this inequality, it is seen that it is convenient to 
analyze the behavior of the stochastic process 



, / *,| x v^, P(or\m*,go <T a T ) 

dtim m) := /In^ — ; r 

P{o T \m,ao <T a T ) 



r=l 



which is the divergence process of m from the reference m*. Indeed, if dt(rn*\\m) — > oo as 
t — > oo, then 



P(m) -pr P(o T \ao <T a T \m) 
hm — II 



li,uS-e-*( m 'H=0, 



t-toc P(m*) ^ P(o T \ao <T a T \m*) t^oo P(m*) 

and thus clearly P{m\ao <t ) — > 0. Figure [3] illustrates simultaneous realizations of the 
divergence processes of a controller. Intuitively speaking, these processes provide lower 
bounds on accumulators of surprise value measured in information units. 




Figure 3: Realization of the divergence processes 1 to 4 associated to a controller with 
operation modes mi to 7714. The divergence processes 1 and 2 diverge, whereas 3 
and 4 stay below the dotted bound. Hence, the posterior probabilities of m\ and 
771,2 vanish. 



A divergence process is a random walk whose value at time t depends on the whole 
history up to time t—1. What makes these divergence processes cumbersome to characterize 
is the fact that their statistical properties depend on the particular policy that is applied; 
hence, a given divergence process can have different growth rates depending on the policy 
(Figured]). Indeed, the behavior of a divergence process might depend critically on the 
distribution over actions that is used. For example, it can happen that a divergence process 
stays stable under one policy, but diverges under another. In the context of the Bayesian 
control rule this problem is further aggravated, because in each time step, the policy that 
is applied is determined stochastically. More specifically, if m* is the true operation mode, 
then dt(m*\\m) is a random variable that depends on the realization ao <t which is drawn 
from 

t 

j^J P(a T \m T , ao <T )P(o T \m* , ao <T a T ), 

T = l 

where the 777-1,777.2, ...,mt are drawn themselves from P (mi ) , P (my. \ aoi ) , . . . , P(mt\ao <t ). 
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Figure 4: The application of different policies lead to different statistical properties of the 
same divergence process. 



To deal with the heterogeneous nature of divergence processes, one can introduce a 
temporal decomposition that demultiplexes the original process into many sub-processes 
belonging to unique policies. Let Aft '■= {1,2, . . . ,t} be the set of time steps up to time t. 
Let T C Aft, and let m,m' £ AA. Define a sub- divergence of dt(m\\m) as a random variable 

, ^ P(o T \m*,qo <T a T ) 

g(m ; T) := > In — — 

£pf. P{o T \m,qo <T a T ) 

drawn from 

Pm'({ao T }rer\{ao T } TeT c) ■= (jj P KK ao <T )) (Yl p (°r\m' ,qo <T a T )^j , 



where 7^ := Aft \ T and where {ao r } rC7 -c are given conditions that are kept constant. In 
this definition, w! plays the role of the policy that is used to sample the actions in the time 
steps T ■ Clearly, any realization of the divergence process dt{m*\\m) can be decomposed 
into a sum of sub-divergences, i.e. 

dt(m*\\m) = } j g(m';T m >), (14) 
m' 

where {7m}meM forms a partition of Aft- Figure [5] shows an example decomposition. 




Figure 5: Decomposition of a divergence process (1) into sub-divergences (2 & 3). 
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The averages of sub-divergences will play an important role in the analysis. Define the 
average over all realizations of g(m'; T) as 

G(m',T):= Y, P m'({Mr}reT\{ao T } T e^)9(rn';r). 

(ao T ) T £T 

Notice that for any t £ Aft, 

G(m';{r}) = J2 P(a T \m',ao <T )P(o T \m*,ao <T a T ) hi P (^\m* , ao <T a T ) > 

ao T v 1 ^ r ' 

because of Gibbs' inequality. In particular, 

G(m*;{r}) =0. 

Clearly, this holds as well for any T C Aft- 

Vm' G(m';T)>0, 
G{m*-T) = 0. 

4.3 Boundedness 



(15) 



In general, a divergence process is very complex: virtually all the classes of distributions 
that are of interest in control go well beyond the assumptions of i.i.d. and stationarity. This 
increased complexity can jeopardize the analytic tractability of the divergence process, such 
that no predictions about its asymptotic behavior can be made anymore. More specifically, 
if the growth rates of the divergence processes vary too much from realization to realiza- 
tion, then the posterior distribution over operation modes can vary qualitatively between 
realizations. Hence, one needs to impose a stability requirement akin to ergodicity to limit 
the class of possible divergence-processes to a class that is analytically tractable. For this 
purpose the following property is introduced. 

A divergence process dt{m*\\m) is said to be bounded in A4 iff for any 5 > 0, there is a 
C > 0, such that for all m! G M, all t and all T C Aft 



g{m';T)-G(m'-T) 



< C 



with probability > 1 — 6. 

Figure [6] illustrates this property. Boundedness is the key property that is going to be 
used to construct the results of this section. The first important result is that the posterior 
probability of the true input-output model is bounded from below. 

Theorem 2 Let the set of operation modes of a controller be such that for all m G A4 the 
divergence process dt(m*\\m) is bounded. Then, for any 6 > 0, there is a A > ; such that 
for all t G N, 

P{m*\ao<t) > |^| 

with probability > 1 — 6. 
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Figure 6: If a divergence process is bounded, then the realizations (curves 2 & 3) of a 
sub-divergence stay within a band around the mean (curve 1). 



4.4 Core 

If one wants to identify the operation modes whose posterior probabilities vanish, then it 
is not enough to characterize them as those modes whose hypothesis does not match the 
true hypothesis. Figure [7] illustrates this problem. Here, three hypotheses along with their 
associated policies are shown. H\ and H2 share the prediction made for region A but differ 
in region B. Hypothesis H% differs everywhere from the others. Assume H% is true. As long 
as we apply policy P2, hypothesis ^3 will make wrong predictions and thus its divergence 
process will diverge as expected. However, no evidence against H2 will be accumulated. It 
is only when one applies policy Pi for long enough time that the controller will eventually 
enter region B and hence accumulate counter-evidence for H.%. 



Hi H 2 H :i 




Figure 7: If hypothesis Hi is true and agrees with H2 on region A, then policy P2 cannot 
disambiguate the three hypotheses. 

But what does "long enough" mean? If Pi is executed only for a short period, then the 
controller risks not visiting the disambiguating region. But unfortunately, neither the right 
policy nor the right length of the period to run it are known beforehand. Hence, an agent 
needs a clever time-allocating strategy to test all policies for all finite time intervals. This 
motivates the following definition. 

The core of an operation mode m* , denoted as [m*], is the subset of Ai containing 
operation modes behaving like m* under its policy. More formally, an operation mode 
m ^ [m*] (i.e. is not in the core) iff for any C > 0, 5, £ > 0, there is a to £ N, such that for 
all t > to, 

G(m*;T) > C 
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with probability > 1 — 5, where G(m*;T) is a sub-divergence of dt(m*\\m), and FV{t G 
T} > C for all r G M- 

In other words, if the agent was to apply m*'s policy in each time step with probability at 
least £, and under this strategy the expected sub-divergence G(m*;T) of dt{m*\\m) grows 
unboundedly, then m is not in the core of m* . Note that demanding a strictly positive 
probability of execution in each time step guarantees that the agent will run m* for all 
possible finite time- intervals. As the following theorem shows, the posterior probabilities of 
the operation modes that are not in the core vanish almost surely. 

Theorem 3 Let the set of operation modes of an agent be such that for all m 6 M the 
divergence process dt(m*\\m) is bounded. If m ^ [m*]> then P{m\ao <t ) — )• as t — )• oo 
almost surely. 

4.5 Consistency 

Even if an operation mode m is in the core of m* , i.e. given that m is essentially indis- 
tinguishable from m* under m*'s control, it can still happen that m* and m have different 
policies. Figure [8] shows an example of this. The hypotheses H\ and H2 share region A but 
differ in region B. In addition, both operation modes have their policies P\ and P2 respec- 
tively confined to region A. Note that both operation modes are in the core of each other. 
However, their policies are different. This means that it is unclear whether multiplexing the 
policies in time will ever disambiguate the two hypotheses. This is undesirable, as it could 
impede the convergence to the right control law. 



Hi H 2 




Figure 8: An example of inconsistent policies. Both operation modes are in the core of each 
other, but have different policies. 

Thus, it is clear that one needs to impose further restrictions on the mapping of hy- 
potheses into policies. With respect to Figure EJ one can make the following observations: 

1. Both operation modes have policies that select subsets of region A. Therefore, the 
dynamics in A are preferred over the dynamics in B. 

2. Knowing that the dynamics in A are preferred over the dynamics in B allows us to 
drop region B from the analysis when choosing a policy. 

3. Since both hypotheses agree in region A, they have to choose the same policy in order 
to be consistent in their selection criterion. 
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This motivates the following definition. An operation mode m is said to be consistent 
with m* iff m £ [m*] implies that for all e < 0, there is a to, such that for all t > to and all 

ao<t a t, 



P(at\m* , ao <t ) — P(at\m* , oo< t ) 



< e. 



In other words, if m is in the core of m* , then m's policy has to converge to m*'s policy. 
The following theorem shows that consistency is a sufficient condition for convergence to 
the right control law. 

Theorem 4 Let the set of operation modes of an agent be such that: for all m 6 Ai the 
divergence process dt(m*\\m) is bounded; and for all m,m' G M, m is consistent with ml . 
Then, 

P(a t \ao <t ) -4 P(at\m*,ao <t ) 

almost surely as t — ^ oo. 
4.6 Summary 

In this section, a proof of convergence of the Bayesian control rule to the true operation 
mode has been provided for a finite set of operation modes. For this convergence result to 
hold, two necessary conditions are assumed: boundedness and consistency. The first one, 
boundedness, imposes the stability of divergence processes under the partial influence of the 
policies contained within the set of operation modes. This condition can be regarded as 
an ergodicity assumption. The second one, consistency, requires that if a hypothesis makes 
the same predictions as another hypothesis within its most relevant subset of dynamics, 
then both hypotheses share the same policy. This relevance is formalized as the core of an 
operation mode. The concepts and proof strategies strengthen the intuition about potential 
pitfalls that arise in the context of controller design. In particular we could show that 
the asymptotic analysis can be recast as the study of concurrent divergence processes that 
determine the evolution of the posterior probabilities over operation modes, thus abstracting 
away from the details of the classes of I/O distributions. The extension of these results to 
infinite sets of operation modes are left for future work. For example, one could think of 
partitioning a continuous space of operation modes into "essent ially different" re gions where 
representative operation modes subsume their neighborhoods (jGriinwaldl . I2007I ). 



5. Examples 

5.1 Bandit Problems 



Consider the multi-armed bandit problem (|Robbinsl . ll952i ). The problem is stated as follows. 



Suppose there is an A^-armed bandit, i.e. a slot-machine with N levers. When pulled, lever 
i provides a reward drawn from a Bernoulli distribution with a bias hi specific to that lever. 
That is, a reward r = 1 is obtained with probability hi and a reward r = with probability 
1 — hi. The objective of the game is to maximize the time-averaged reward through iterative 
pulls. There is a continuum range of stationary strategies, each one parameterized by A" 
probabilities {s,}^ indicating the probabilities of pulling each lever. The difficulty arising 
in the bandit problem is to balance reward maximization based on the knowledge already 
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acquired with attempting new actions to further improve knowledg e . Thi s dilemma is known 
as the exploration versus exploitation tradeoff (jSutton and Bartd . Il998l ) . 

This is an ideal task for the Bayesian control rule, because each possible bandit has a 
known optimal agent. Indeed, a bandit can be represented by an ^-dimensional bias vector 
m = [mi,...,mjv] £ M = [0;1]^. Given such a bandit, the optimal policy consists in 
pulling the lever with the highest bias. That is, an operation mode is given by: 



P(o t = l\m,a t = i) 



mi 



P(a t 



i\m 



1 if i = maxjjmj}, 
else. 



1712 1 



a) 




Figure 9: The space of bandit configurations can be partitioned into N regions according 
to the optimal lever. Panel a and b show the 2-armed and 3-armed bandit cases 
respectively. 

To apply the Bayesian control rule, it is necessary to fix a prior distribution over the 
bandit configurations. Assuming a uniform distribution, the Bayesian control rule is 

P{a t+ \ = i\ao<t) = j P(at+i = i\m)P(m\ao <t ) 
Jm 

with the update rule given by 

P(m)Ylr=l P ( r\m,ar) T~T ^ "'^ 

S M nt=i P(o T \m', or) dm' W B( rj + 1, f, + 1) 

where r,- and fj are the counts of the number of times a reward has been obtained from 
pulling lever j and the number of times no reward was obtained respectively. Observe that 
here the summation over discrete operation modes has been replaced by an integral over 
the continuous space of configurations. In the last expression we see that the posterior 
distribution over the lever biases is given by a product of ./V Beta distributions. Thus, 
sampling an action amounts to first sample an operation mode m by obtaining each bias 
rrij from a Beta distribution with parameters rj + 1 and fj + 1, and then choosing the action 
corresponding to the highest bias i = argmaxj mj. 

Simulation: The Bayesian control rule described above has been compared against two 
other agents: an e-greedy strategy with decay (on-line) and Gittins indices (off-line). The 
test bed consisted of bandits with N = 10 levers whose biases were drawn uniformly at the 



P(m\ao<t) = 
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Figure 10: Comparison in the A-armed bandit problem of the Bayesian control rule (solid 
line), an e-greedy agent (dashed line) and using Gittins indices (dotted line). 
1,000 runs have been averaged. The top panel shows the evolution of the average 
reward. The bottom panel shows the evolution of the percentage of times the 
best lever was pulled. 



beginning of each run. Every agent had to play 1000 runs for 1000 time steps each. Then, 
the performance curves of the individual runs were averaged. The e-greedy strategy selects 
a random action with a small probability given by eoT 1 and otherwise plays the lever with 
highest expected reward. The parameters have been determined empirically to the values 
e = 0.1, a = 0.99 and r = 0.7 after several test runs. They have been adjusted in a way 
to maximize the average performance in the last trials of our simulations. For the Gittins 
method, all the indices were computed up to horizon 1300 using a geometric discounting 
of a = 0.999, i.e. close to one to approximate the time-averaged reward. The results are 
shown in Figure [TO]. 

It is seen that e-greedy strategy quickly reaches an acceptable level of performance, 
but then seems to stall at a significantly suboptimal level, pulling the optimal lever only 
60% of the time. This can be improved by using a value for e that decays over time. In 
contrast, both the Gittins strategy and the Bayesian control rule show essentially the same 
asymptotic performance, but differ in the initial transient phase where the Gittins strategy 
significantly outperforms the Bayesian control rule. There are at least three observations 
that are worth making here. First, Gittins indices have to be pre-computed off-line. The 
time complexity scales quadratically with the horizon, and the computations for the horizon 
of 1300 steps took several hours on our machines. In contrast, the Bayesian control rule 
could be applied without pre-computation. Second, even though the Gittins method actively 
issues the optimal information gathering actions while the Bayesian control rule passively 
samples the actions from the posterior distribution over operation modes, in the end both 
methods rely on the convergence of the underlying Bayesian estimator. This implies that 
both methods have the same information bottleneck, since the Bayesian estimator requires 
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the same amount of information to converge. Thus, active information gathering actions 
only affect the utility of the transient phase, not the perm anent state. Other efficient 



algorithms for bandit problems can be found in the literature (jAuer et all l2002l ) . 



5.2 Markov Decision Problems 

A Markov Decision Process (MDP) is defined as a tuple (X ,A,T,r): X is the state space; 
A is the action space; T a (x;x') = Pr(x'\a,x) is the probability that an action a £ A 
taken in state x G X will lead to state x' £ X; and r(x,a) £ 1Z := M is the immediate 
reward obtained in state x £ X and action a £ A. The interaction proceeds in time steps 
t = 1, 2, . . . where at time t, action at £ A is issued in state xt-i £ X, leading to a reward 
rt = r(xt-i,at) and a new state xt that starts the next time step t + 1. A stationary closed- 
loop control policy n : X — > A assigns an action to each state. For MDPs there always 
exists an optimal stationary deterministic policy and thus one only needs to consider such 
policies. In undiscounted MDPs the average reward per time step for a fixed po licy n with 



initia l state x is defined as p n (x) = lim^oo E^fj ^^. =0 r T ]. It can be shown ( Bertsekasl 



19871 ) that p w (x) = p 7T (x') for all x,x' £ X under the assumption that the Markov chain for 
policy 7r is ergodic. Here, w e assume that t he MDPs are ergodic for all stationary policies. 
Following the Q-notation of IWatkins (119891 ). the optimal policy ir* can be characterized in 
terms of the optimal average reward p and the optimal relative Q-values Q(x,a) for each 
state-action pair (x, a) that are solutions to the following system of non-linear equations 
( Singh . 19941 ): for any state x £ X and action a £ A, 



Q(x, a) + p = r(x, a) + > Pr(x'\x, a) m&yLQ(x',a' 

— ' La' 



r(x, a) + Ej./ max Q(x', a') 

L a' 



x, a 



(16) 



The optimal policy can then be defined as ir*(x) := argmax a Q(x, a) for any state x £ X . 

Again this setup allows for a straightforward solution with the Bayesian control rule, 
because each possible MDP (characterized by the Q-values and the average reward) has 
a known solution tt*(x). Accordingly, the operation modes m are given by (Q m ,p m )- To 
obtain a likelihood model for inference over m, we realize that equation (|16p can be rewritten 
such that it predicts the instantaneous reward r(x,a) as the sum of a mean instantaneous 
reward £ m plus a noise term v given the Q m -values and the average reward p m for the MDP 
labeled by m 

r(x,a) = Q m (x,a) + p m - maxQ m (x', a') + maxQ m (x', a') - B[maxQ m (x' , a')\x, a] 



mean instantaneous reward ^ m (x,a,x') noise u 

Assuming that v can be reasonably approximated by a normal distribution N(0, 1/p) with 
precision p, we can write down a likelihood model for the immediate reward r using the 
Q-values and the average reward, i.e. 



P(r\m,x,a,x') = J — exp|--(r - £ m (x, a, x')) 2 |. 



(17) 
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In order to determine the intervention model for each operation mode, we can simply exploit 
the above properties of the Q-values, which gives 



P(a\m, x) 



1 if a = argmax a / Q(x, a') 
else. 



(18) 



To apply the Bayesian control rule over the controllers m, the intervened posterior dis- 
tribution P(m\a<t, x<t) needs to be computed. Fortunately, due to the simplicity of the 
likelihood model, one can easily devise a conjugate prior distribution and apply standard 
inference methods (see Appendix lA.5p . Actions are again determined by sampling operation 
modes from this posterior and executing the action suggested by the corresp onding interven 



tion models. The resulting algorithm is very similar to Bayesian Q-learning (jDearden et al. 
19981 . 119991 b but differs in the way actions are selected. 



b) Bayesian conlrol rule c) R-learning, C-5 d) R-learning, C-30 e) R-learning, C-200 




xlOOO time steps 



Figure 11: Results for the 7x7 grid-world domain. Panel (a) illustrates the setup. Columns 
(b)-(e) illustrate the behavioral statistics of the algorithms. The upper and lower 
row have been calculated over the first and last 5,000 time steps of randomly 
chosen runs. The probability of being in a state is color-encoded, and the arrows 
represent the most frequent actions taken by the agents. Panel (f) presents the 
curves obtained by averaging ten runs. 



Simulation: We have tested our MDP-agent in a grid-world example. To give an intuition 
of the achieved performance, the results are contr asted with th ose achieved by R-learning. 
We have used the R-learning variant presented in Singhl (1994, Algorithm 3) together with 
the uncertainty exploration strategy (jMahadevanl . 19961 ). The corresponding update equa- 
tions are 

Q(x, a) <— (1 — a)Q(x, a) + a(r — p + maxQ(x', a')) 

a ' (19) 
p±- (1- (3)p + /3(r + m&xQ(x',a') - Q(x,a)), 



where a, f3 > are learning rates. The exploration strategy chooses with fixed probability 



c* 



F(x,a) 



where C is a constant, and Fix, 



Pcxp > the action a that maximizes Q(x, a) + 

represents the number of times that action a has been tried in state x. Thus, higher values 
of C enforce increased exploration. 
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Average Reward 

BCR 0.3582 ± 0.0038 

R-learning, C = 200 0.2314 ± 0.0024 

R-learning, C = 30 0.3056 ± 0.0063 

R-learning, C = 5 0.2049 ± 0.0012 



Table 1: Average reward attained by the different algorithms at the end of the run. The 
mean and the standard deviation has been calculated based on 10 runs. 



In lMahadevanl (jl996h . a grid-world is described that is especially useful as a test bed for 
the analysis of RL algorithms. For our purposes, it is of particular interest because it is easy 
to design experiments containing suboptimal limit-cycles. Figure [TTT panel (a), illustrates 
the 7x7 grid-world. A controller has to learn a policy that leads it from any initial location 
to the goal state. At each step, the agent can move to any adjacent space (up, down, left 
or right). If the agent reaches the goal state then its next position is randomly set to any 
square of the grid (with uniform probability) to start another trial. There are also "one- 
way membranes" that allow the agent to move into one direction but not into the other. 
In these experiments, these membranes form "inverted cups" that the agent can enter from 
any side but can only leave through the bottom, playing the role of a local maximum. 
Transitions are stochastic: the agent moves to the correct square with probability p = yjj 
and to any of the free adjacent spaces (uniform distribution) with probability 1 — p = j^. 
Rewards are assigned as follows. The default reward is r = 0. If the agent traverses a 
membrane it obtains a reward of r = 1. Reaching the goal state assigns r = 2.5. The 
parameters chosen for this simulation were the following. For our MDP-agent, we have 
chosen hyperparameters fJ,Q = 1 and Ao = 1 and precision p = 1. For R-learning, we have 
chosen learning rates a = 0.5 and f3 = 0.001, and the exploration constant has been set to 
C = 5, C = 30 and to C = 200. A total of 10 runs were carried out for each algorithm. The 
results are presented in Figure [TT] and Table 15.21 R-learning only learns the optimal policy 
given sufficient exploration (panels c & d, bottom row), whereas the Bayesian control rule 
learns the policy successfully. In Figure [Tie , the learning curve of R-learning for C = 5 
and C = 30 is initially steeper than the Bayesian controller. However, the latter attains a 
higher average reward around time step 125,000 onwards. We attribute this shallow initial 
transient to the phase where the distribution over the operation modes is flat, which is also 
reflected by the initially random exploratory behavior. 



6. Discussion 



The key idea of this work is to extend the minimum relative entropy principle, i.e. the 
variational principle underlying Bayesian estimation, to the problem of adaptive control. 
From a coding point of view, this work extends the idea of maximal compression of the 
observation stream to the whole experience of the agent containing both the agent's actions 
and observations. This not only minimizes the amount of bits to write when saving /encoding 
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the I/ O stream, but i t also minimizes the amount of bits required to produce/decode an 
action (jMacKavl . 120031 . Chapter 6) . 

This extension is non-trivial, because there is an important caveat for coding I/O se- 
quences: unlike observations, actions do not carry any information that could be used for 
inference in adaptive coding because actions are issued by the decoder itself. The problem 
i s that doing i nference on ones own actions is logically inconsistent and leads to paradoxes 
(jNozickl . Il96d ). This seemingly innocuous issue has turned out to be very intricate and 
has been invest i gated intensely in the recent pa s t by researchers focusing on the issue of 
causality (|Pearil200d : ISpirtes et afll200d : lDa^l20ld ). Our work contributes to this body 
of research by providing further evidence that actions cannot be treated using probability 
calculus alone. 

If the causal dependencies are carefully taken into account, then minimizing the relative 
entropy leads to a rule for adaptive control which has been called the Bayesian control rule. 
This rule allows combining a class of task-specific agents into an agent that is universal 
with respect to this class. The resulting control law is a simple stochastic control rule that 
is completely general and parameter-free. As the analysis in this paper shows, this control 
rule converges to the true control law under mild assumptions. 



6.1 Critical issues 

• Causality. Virtually every adaptive control method in the literature successfully treats 
actions as conditionals over observation streams and never worries about causality. 
Thus, why bother about interventions? In a decision-theoretic setup, the decision 
maker chooses a policy tt* £ IT maximizing the expected utility U over the outcomes 
w G fi, i.e. 7r* := argmax„- E[Z7|7r] = Ylw Pr(w|7r)[/(cj). "Choosing 7r*" is formally 
equivalent to choosing the Kronecker delta function 5^, as the probability distribution 
over policies. In this case, the conditional probabilities Pr(u;|7r) and Pr(w|-7r) coincide, 
since 

Pr(w,vr) = Pr(vr)Pr(o;|7r) = <^*Pr(u;|7r) = Pr(u,n). 

Hence, the formalization of actions as interventions and observations as conditions is 
perfectly compatible with the decision-theoretic setup and in fact generalizes decision 
variables to the status of intervened random variables. 

• Where do prior probabilities/likelihood models/policies come from? The predictor in 
the Bayesian control rule is essentially a Bayesian predictor and thereby entails (al- 
most) the same modeling paradigm. The designer has to define a class of hypotheses 
over the environments, construct appropriate likelihood models, and choose a suitable 
prior probability distribution to capture the model's uncertainty. Similarly, under suf- 
ficient domain knowledge, an analogous procedure can be applied to construct suitable 
operation modes. However, there are many situations where this is a difficult or even 
intractable problem in itself. For example, one can design a class of operation modes 
by pre-computing the optimal policies for a given class of environments. Formally, let 
O be a class of hypotheses modeling environments and let n be class of policies. Given 
a utility criterion U, define the set of operation modes M. := {me}e e Q by construct- 
ing each operation mode as mg := (9,-ir*), tt* £ tt, where ir* := argmax^ E[C/|^,7r]. 
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However, computing the optimal policy n* is in many cases intractable. In some 
cases, this can be remedied by characterizing the operation modes through optimality 
equations which are solved by probabilistic inference as in the example of the MDP 
agent in Section \5.2l Recently, we have app lied a similar approach to adaptive control 
problems with linear quadratic regulators (lOrtega and Braun . 



Problems of Bayesian methods. The Bayesian control rule treats an adaptive control 
problem as a Bayesian inference problem. Hence, all the problems typically associated 
with Bayesian methods carry over to agents constructed with the Bayesian control 
rule. These problems are of both analytical and computational nature. For example, 
there are many probabilistic models where the posterior distribution does not have a 
closed-form solution. Also, exact probabilistic inference is in general computationally 
very intensive. Even though there is a large lit erature i n effic ient /approximate infer- 
ence algorithms for particular problem classes rtBishod . l2006h . not many of them are 



suitable for on-line probabilistic inference in more realistic environment classes. 

• Bayesian control rule versus B 'ayes- optimal control. Directly maximizing the (subjec- 
tive) expected utility for a given environment class is not the same as minimizing the 
expected relative entropy for a given class of operation modes. As such, the Bayesian 
control rule is not a Bayes-optimal controller. Indeed, it is easy to design experiments 
where the Bayesian control rule converges exponentially slower (or does not converge 
at all) than a Bayes-optimal controller to the maximum utility. Consider the following 
simple example: Environment 1 is a /c-state MDP in which only k consecutive actions 
A reach a state with reward +1. Any interception with a 5-action leads back to the 
initial state. Consider a second environment which is like the first but actions A and 
B are interchanged. A Bayes-optimal controller figures out the true environment in k 
actions (either k consecutive ^4's or fTs). Consider now the Bayesian control rule: The 
optimal action in Environment 1 is A, in Environment 2 is B. A uniform (^, ^) prior 
over the operation modes stays a uniform posterior as long as no reward has been 
observed. Hence the Bayesian control rule chooses at each time-step A and B with 
equal probability. With this policy it takes about 2 k actions to accidentally choose a 
row of ^4's (or £>'s) of length k. From then on the Bayesian control rule is optimal 
too. So a Bayes-optimal controller converges in time k, while the Bayesian control 
rule needs exponentially longer. One way to remedy this problem might be to allow 
the Bayesian control rule to sample actions from the same operation mode for several 
time steps in a row rather than randomizing controllers in every cycle. However, if 
one considers non-stationary environments this strategy can also break down. Con- 
sider, for example, an increasing MDP with k = |lO\/i], in which a Bayes-optimal 
controller converges in 100 steps, while the Bayesian control rule does not converge 
at all in most realizations, because the boundedness assumption is violated. 

6.2 Relation to existing approaches 

Some of the ideas underlying this work are not unique to the Bayesian control rule. The 
following is a selection of previously published work in the recent Bayesian reinforcement 
learning literature where related ideas can be found. 
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Compression principles. In the lit erature, there i s an important amount of work 
relating compression to intelligence ([MacKavl . 120031 ; iHutterl . l2004al ). In particular, it 
has been eve n proposed that compression ratio is an objective quantitative measure of 
intelligence jMahonevl [l999l ). Com pression has a l so be en used as a basis for a theory 



intelligence (iiviaiioncv. iimmm ). com pression nas a l so De e 
of curiosity, creativity and beauty (|Schmidhuber . 20091 ) 



• Mixture of experts. Passive se quence prediction by mixing ex perts has been stud - 
ied extensively in the literature (jCesa-Bianchi and Lugosi , l200fih . In (jHutteri . liooibl ). 
Bayes-optimal predictors are mixed. Bayes-mixtures can also be used for univer- 
sal prediction (jHutteri . 120031 ). For the control case, the idea of using mixtures of 
expert-controllers ha s been previously evoked in models like the MOSAIC-architecture 
dHaruno et all l200ll ). Universal le arning with Bayes mixtures of experts in reactive 
environments has been studied in (jPoland and Hutterl . 120051 ; IHutterl . |2002j). 



St ochastic actio n selection. Other stochastic action selection approaches are found 



in 



WvattJ (119971) who examines exp l oratio n strategies for (PO)MDPs, in learning au- 



toma ta (jNarendra and Thathacharl . 1 1974) and in probability matching (|R.O. Dudal . 
200 ll ) amongst others. In particular. IWyattl (|l997 l) discusses theoretical properties of 



an extension to probability matching in the context of multi-armed bandit problems. 
There, it is proposed to choose a lever according to how likely it is to be optimal and 
it is shown that this strategy converges, thus providing a simple method for guiding 
exploration. 

• Relative entropy criterion. The usage of a minimum relative entr opy criter i on to 
deriv e control laws underli es the KL-control methods developed in iTodorov! (2006, 
20091 ): iKappen et all (|2009l ). There, it has been shown that a large class of optimal 



control problems can be solved very efficiently if the problem statement is reformulated 
as the minimization of the deviation of the dynamics of a controlled system from the 
uncontrol led system. A relate d idea is to conceptualize planning as an inference 
problem ( Toussaint et al. . 20061 ). This approach is based on an equivalence between 
maximization of the expected future return and likelihood maximization which is both 
applicable to MDPs and POMDPs. Algorithms b ased on this duality have become a n 
active field of current research. See for example iRasmussen and Deisenroth (120081 ). 
where very fast model-based RL techniques are used for control in continuous state 
and action spaces. 



7. Conclusions 

This work introduces the Bayesian control rule, a Bayesian rule for adaptive control. The 
key feature of this rule is the special treatment of actions based on causal calculus and the 
decomposition of an adaptive agent into a mixture of operation modes, i.e. environment- 
specific agents. The rule is derived by minimizing the expected relative entropy from the 
true operation mode and by carefully distinguishing between actions and observations. Fur- 
thermore, the Bayesian control rule turns out to be exactly the predictive distribution over 
the next action given the past interactions that one would obtain by using only probability 
and causal calculus. Furthermore, it is shown that agents constructed with the Bayesian 
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control rule converge to the true operation mode under mild assumptions: boundedness, 
which related to ergodicity; and consistency, demanding that two indistinguishable hypothe- 
ses share the same policy. 

We have presented the Bayesian control rule as a way to solve adaptive control problems 
based on a minimum relative entropy principle. Thus, the Bayesian control rule can either 
be regarded as a new principled approach to adaptive control under a novel optimality 
criterion or as a heuristic approximation to traditional Bayes-optimal control. Since it 
takes on a similar form to Bayes' rule, the adaptive control problem could then be translated 
into an on-line inference problem where actions are sampled stochastically from a posterior 
distribution. It is important to note, however, that the problem statement as formulated 
here and the usual Bayes-optimal approach in adaptive control are not the same. In the 
future the relationship between these two problem statements deserves further investigation. 
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Appendix A. Proofs 
A.l Proof of Theorem Q] 

Proof The proof follows the same line of argument as the solution to Equation [3] with 
the crucial difference that actions are treated as interventions. Consider without loss of 
generality the summand J2 m P(m)C^ in Equation^ Note that the relative entropy can be 
written as a difference of two logarithms, where only one term depends on Pr to be varied. 
Therefore, one can integrate out the other term and write it as a constant c. This yields 

c-^P(m)^ P(ao<t\ m ) y^P(Qf|ra,ag <t )lnPr(q f |ao <f ). 

m QO.<t a * 

Substituting P(ao <t \m) by P(m\ao <t )P(ao <t ) I 'P(m) using Bayes' rule and further rear- 
rangement of the terms leads to 

)P{ao<t) ^PK| rn. ao <t ) lnPv(at\ao <t ) 

= c ~Yl p ^<t) ^ p ( a *i^<*) lnPr ( a *i^<*)- 

ao <t a t 

The inner sum has the form — ^2 x p(x) lnq(x), i.e. the cross-entropy between q(x) and 
p(x), which is minimized when q{x) = p(x) for all x. By choosing this optimum one obtains 
Pr (at\ao <t ) = P(at\ao <t ) for all at- Note that the solution to this variational problem is 
independent of the weighting P(ao <t ). Since the same argument applies to any summand 
J] m P(m)C° T and ^Z m P{nn,)C^l in Equation El their variational problems are mutually 
independent. Hence, 

P{a t \qo<t) = P(a t \ao<t) p (°t\ao<t) = P{o t \ao<t^t) 

for all ao <t G Z* . For P(at\ao <t ), introduce the variable m via a marginalization and then 
apply the chain rule: 

P(a t \ao<t) = y~] P((H+i (W <t )P(m\ao <t ). 

m 
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The term P(m\ao <t ) can be further developed as 

P(ao <t \m)P(m) 



P(m\ao. 



<tJ Em' P(ao <t \m')P(m') 

P(m) ]lt=i P(a T \m,ao <T )P(o T \m,qo <T a T ) 
E m ' P( m ')Ul~JiP(ar\m\^ <T )P(o T \m',ao <T a T ) 

-P(m)nt=i P(o T \m,ao <T a T ) 
Em' P ( m ') Ur Ji P(o T \m',qo <T a T ) 

The first equality is obtained by applying Bayes' rule and the second by using the chain 
rule for probabilities. The second equality follows from using the causal factorization of the 
joint probability distribution. To get the last equality, one applies the interventions to the 
causal factorization. Thus, P(a T \m,qo <T ) = 1 and P(o T \m, ao <T a T ) = P(o T \m, ao <T a T ). 
The equations characterizing P(o t \ao <t at) are obtained similarly. ■ 



A.2 Proof of Theorem [2] 

Proof As has been pointed out in (|14p . a particular realization of the divergence process 
dt(m*\\m) can be decomposed as 

d t (m*\\m) = ^2g m (m';T m >), 
m' 

where the g m {ml; T m ') are sub-divergences of dt(m*\\m) and the T m > form a partition of Aft- 
However, since dt(m*\\m) is bounded in A4, one has for all 5' > 0, there is a C{m) > 0, 
such that for all m' G A4, all t G Aft and all T C Aft, the inequality 

< m (m';T m >) - G m (m';T m >) <C(m) 



holds with probability > 1 — 5' . However, due to (|15p . 

G m (m';T m >) > 

for all m'eM. Thus, 

9m(jn';T m ') > -C(m). 

If all the previous inequalities hold simultaneously then the divergence process can be 
bounded as well. That is, the inequality 

d t (m*\\m) > -MC{m) (20) 

holds with probability > (1 — 5') M where M := \M\. Choose 

/3(m) := max{0, In 
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Since > In p^}^ — f3(m), it can be added to the right hand side of ([20]) . Using the 
definition of dt(m*\\m), taking the exponential and rearranging the terms one obtains 

t t 
P{m*) Y[ P{o T \m*,qo <T a T ) > e - a{m) P(m) ]J P(o T \m* ,ao <T a T ) 

T = l T = l 

where a(m) := MC(m) + (3{m) > 0. Identifying the posterior probabilities of m* and m 
by dividing both sides by the normalizing constant yields the inequality 

P{m*\ao<t) > e' a{m) P{m\ao<t)- 

This inequality holds simultaneously for all m G Ai with probability > (1 — S') m2 and in 
particular for A := min m {e _a ( m )}, that is, 

P{m*\ao <t ) > XP(m\ao <t ). 

But since this is valid for any m G M., and because max m {P(m\ao <t )} > jj, one gets 

P{m*\ao<t) > Jj, 

with probability > 1 — 5 for arbitrary 5 > related to 5' through the equation 5' := 

i - M ^/i^5. m 



A.3 Proof of Theorem [3] 

Proof The divergence process dt{m*\\m) can be decomposed into a sum of sub-divergences 
(see Equation fill) 

d t (m*\\m) = Y,9(m';T m i). (21) 



Furthermore, for every m'6M, one has that for all 6 > 0, there is a C > 0, such that for 
all t G N and for all T C M t 

g(m';T)-G(m';T) < C(m) 

with probability > 1 — 5' . Applying this bound to the summands in (|2ip yields the lower 
bound 

Eff(m';r m o>2(G(m' ; r m /)-c(m)) 

m' m' 

which holds with probability > (1 — 5') M , where M := \M\. Due to Inequality 1151 one has 
that for all m! ^ m*, G(m'; T m ') > 0. Hence, 

^T(G(m';T m/ ) - C(m)) > G(m*;T m *) - MC 



where C := max m {C(m)}. The members of the set Tm* are determined stochastically; more 
specifically, the i th member is included into Tm* with probability P{rn*\ao <i ). But since 



31 



Ortega and Braun 



m ^ [m*], one has that G(m*;T m *) — > oo as t — > oo with probability > 1 — 8' for arbitrarily 
chosen 5' > 0. This implies that 

lim dt(m*\\m) > lim G(m*;T m *) — MC f~ oo 

t— >oo t— >oo 

with probability > 1 — 8, where 5 > is arbitrary and related to 8' as 8 = 1 — (1 — <5') +1 . 
Using this result in the upper bound for posterior probabilities yields the final result 



P(m) 

t^oo" v '"'~- w ~ t^oo P(m*)' 



< lim P(m|oo<i) < lim ±^L e -dt{m*\\m) = Q 



A. 4 Proof of Theorem |4] 

Proof We will use the abbreviations p m (t) := P(at\m, ao <t ) and w m {t) := P(m|ao <t ). 
Decompose -P(at|ao <t ) as 

P(a t \ao <t ) = ^2 Pm(t)w m (t)+ ^2 Pm(t)w m (t). (22) 

m^[m*] mg[m*] 

The first sum on the right-hand side is lower-bounded by zero and upper-bounded by 

Pm(t)w m (t) < W m (t) 
m^\m*\ m^[m*] 

because p m {t) < 1. Due to Theorem El w m (t) — > as t — > oo almost surely. Given e' > 
and <5' > 0, let to(m) be the time such that for all t > to(m), w m (t) < e' . Choosing 
to : = max m {to(w)}, the previous inequality holds for all m and t > to simultaneously with 
probability > (1 — 8') . Hence, 

^2 Pm{t)w m {t) < W m(t) < M ^ '■ ( 2 3) 

To bound the second sum in (|22l) one proceeds as follows. For every member m G [m*], 
one has that p m {t) — > p m *{t) as t — > oo. Hence, following a similar construction as above, 
one can choose t f such that for all t > t' and m £ [m*], the inequalities 



Pm{t) -p m *{t) 



<e' 



hold simultaneously for the precision e' > 0. Applying this to the first sum yields the 
bounds 

{Pm*{t) - e')w m (t) < Pm{t)w m (t) < (p TO *(i) + e')u; m (t). 

m€[m*] mg[m*] mg[m*] 
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Here (p m * (t) ± e'J are multiplicative constants that can be placed in front of the sum. Note 
that 

1> ^2 w m (t) = l- ^2 w m (t)>l-e. 

rad\m*} m^[m*] 

Use of the above inequalities allows simplifying the lower and upper bounds respectively: 



[Pm 



(t)-e') wm{t)>p m *{t){l-e')-e' >p m *{t)-2e', 



?ti£ m* 



( Pm * (t) + e') w m(t) < Pm* (t) + e' < Pm * (t) + 2s'. 

mS[m*] 



(24) 



Combining the inequalities (|23j) and (|24j) in ([22]) yields the final result: 



P(a t |ao <t ) -p m *{t) 



< 3e' = e, 



which holds with probability > 1 — 5 for arbitrary 5 > related to 8' as 8' = 1 — vl — <5 
and arbitrary precision e. ■ 



A. 5 Gibbs Sampling Implementation for MDP agent 

Inserting the likelihood given in Equation (|17|) into Equation (|13|) of the Bayesian control 
rule, one obtains the following expression for the posterior 

P(x'\m, x, a)P(r\m, x, a, x')P(m\a <t , o <t ) 



P(m\a<t,o< t ) 



J ^, P(x'\m', x, a)P(r\m', x, a, x / )P(m'|a<t, o<t) dm' 

P(r\m, x, a, x')P(m\a < t,o < t) 
f j^, P(r\m', x, a, x / )P(m'|a<t, o<f) dm' ' 



(25) 



where we have replaced the sum by an integration over m' , the finite-dimensional real space 
containing only the average reward and the Q-values of the observed states, and where we 
have simplified the term P(x'\m,x,a) because it is constant for all m! G M'. 

By inspection of Equation ([25]) . one sees that m encodes a set of independent nor- 
mal distributions over the immediate reward having means ^ m (x,a,x') indexed by triples 
(x, a, x') £ X x A x X . In other words, given (x, a, x'), the rewards are drawn from a normal 
distribution with unknown mean £ m (x,a,x') and known variance a 2 . The sufficient statis- 
tics are given by n(x,a,x'), the number of times that the transition x — )■ x' under action 
a, and f(x,a, x'), the mean of the rewards obtained in the same transition. The conjugate 
prior distribution is well known and given by a normal distribution with hyperparameters 
Ho and Aq: 



P(U(x,a,x'))=N(vo,l/A ) = y^exp{-^(^ m (x,a,x') -m) 2 }- (26) 
The posterior distribution is given by 

P(£ m (x, a, x')\d< t , o< t ) = N(/i(x, a, x'), 1/X(x, a, x')) 
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where the posterior hyperparameters are computed as 

/ a XoPo+pn(x,a,x')f(x,a,x') 

ulx.a.x ) = ; ; 

\o+pn(x,a,x>) (27) 

X(x, a, x) = Ao + pn(x, a, x ). 
Finally, the conjugate distribution of the parameter vector m is simply the product 

P(m\a<t,o< t ) = |~[ P(Cm(x,a,x')\a< t ,o< t ) (28) 



x 



ex p{~2 ^ K x i a i x '){£m( x ,a, x ') ~ fi(x,a,x')) 2 } (29) 



because the £ m (x,a,x') are independent but at the same time functions of m. Thus, the 
MDP agent is fully specified by the action probabilities in Equation ()18p . the likelihood 
model in Equation (fT7|) . and the prior distribution ([26]) . 

Inference can be carried out by sampling m from the posterior distribution in Equa- 
tion (|28p . The actions issued by the agent are by-products of the inference process. Here 
we derive an approximate Gibbs sampler for m. We introduce the following symbols: m~ p 
and m - ^ 1 ' -) stand for the parameter set removing p and Q(x,a) respectively; p and A 
are matrices collecting the values of the posterior hyperparameters p(x, a, x') and X(x, a, x') 
respectively; and M(x) := m&x a Q(x,a) is a shorthand. 

Substituting £ m (x, a, x') in Equation (|28j) by its definition (see Section [5^2]) and condi- 
tioning on the Q- values, we obtain the conditional distribution of p: 

P(p\m-e,p,\)=N(p,l/S) (30) 

where 



s 

x,a,x' 



S = M x > a > x ')- 



The conditional distribution over the Q-values is more difficult to obtain, because each 
Q(x, a) enters the posterior distribution both linearly and non-linearly through p. However, 
if we fix Q(x,a) within the max operations, which amounts to treating each M{x) as a 
constant within a single Gibbs step, then the conditional distribution can be approximated 
by 

P(Q(x,a)\m~ Q ^ a \\,p) « n(q(x, a), 1/S(x, a)) (31) 



where 



Q{ x > a ) = at 1 \ a ' X ')(K X > °) x ) - P + M ( x '))i 

S{x,a) *—f 

x 

S(x, a) = -M^j a ; x ')- 
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We expect this approximation to hold because the resulting update rule constitutes a 
contraction operat ion that forms the basis of most stochastic approximation algorithms 
(jMahadevanl . ll996h . As a result, the Gibbs sampler draws all the values from normal distri- 
butions. In each cycle of the adaptive controller, one can carry out several Gibbs sweeps to 
obtain a sample of m to improve the mixing of the Markov chain. However, our experimental 
results have shown that a single Gibbs sweep per state transition performs reasonably well. 
Once a new parameter vector m is drawn, the Bayesian control rule proceeds by taking the 
optimal action given by Equation (|18p . Note that only the /i and A entries of the transitions 
that have occurred need to be represented explicitly; similarly, only the Q-values of visited 
states need to be represented explicitly. 
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